#### "Video Tracking" using Profile Guided Dataflow Transformation

Andrew Wallace, Institute for Sensors, Signals and Systems Heriot-Watt University

... being a discussion of some of the activity on

**EP/K009931/1:** Programmable embedded platforms for remote and compute intensive image processing applications, 2013-2017 ('RATHLIN')

Greg Michaelson, Rob Stewart, Deepayan Bhowmik, Nathanel Lemessa Baisa, HWU, Roger Woods, Fahad Siddiqui, Colm Kelly and Burak Bardak, QUB (with INSA, Clermont-Ferrand, EPFLausanne, Thales, Xilinx)







## Objectives of the EPSRC Programme

- Use a "model of computation" dataflow process network (DPN)
  representation which will allow the processing and data
  organisation needs of image processing/analysis (IP/A) to be
  readily captured.
- Develop a domain specific <u>Image Processing Processor</u> (IPPro) processor architecture
- Develop code translation and transformation techniques that will allow efficient implementation on a variety of platforms (e.g. Multicore CPU, FPGA, GPU, IpPro)
- Develop a Domain Specific Language, RIPL, 'above' the DPN, for ease of use by practitioners
- Evaluate using a set of prototypical and novel IP/A algorithms
   expressed as application specific DPNs

# The Target: a Distributed, Heterogeneous Architecture

#### Research Challenge

- Data intensive image processing applications e.g. Video Analytics, Surveillance, Smart Cameras and other sensors
- Option of distributed front-end image processing to reduce communication (and other costs) of backend processing

#### Research Methodology

- Distributed computing
- Data or control level parallelism (DLP)
- Programmability and Performance
- Single/Multiple Instruction Multiple Data (S/MIMD)

#### Processor based Architecture

- Programmability
- Scalability
- Flexibility
- Efficient Resource Utilization



e.g. Xilínx-7: FPGA hardware plus ARM







## What kind of Dynamic Imaging?

Multi-target tracking, either from a CCTV network, or from a mobile vehicle or vehicles (Sensor – Region – Algorithm Utility)





S Matzka, AM Wallace and YR Petillot, "Efficient Resource Allocation for Automotive Attentive Vision Systems", IEEE Transactions on Intelligent Transportation Systems, 859-872, 2012.







## What kind of Dynamic Imaging?

Multi-target tracking, either from a CCTV network, or from a mobile vehicle or vehicles



W Limprasert, AM Wallace and G Michaelson. "Real-time People Tracking in a Camera Network", IEEE Journal on Circuits and Systems, 263-271 June 2013







#### What's the problem?

Code development for parallel or heterogeneous architectures .... e.g. it took several months to hand-craft GPU code to detect, track and associate 5-10 subjects with live video with two cameras at 10fps (40fps on recorded video)

TABLE 5 GPU acceleration

| Function       | CPU      | GPU      | SpeedUp |
|----------------|----------|----------|---------|
| Turctori       | time(ms) | time(ms) | ratio   |
| Detection      | 48.81    | 8.43     | 5.8     |
| Likelihood     | 30.95    | 13.60    | 2.3     |
| Fusion         | _        | 0.09     | -       |
| Resampling     | 0.44     | 0.56     | 0.8     |
| Transition     | 2.11     | 0.37     | 5.7     |
| Remove         | _        | 0.02     | -       |
| Add            | 0.02     | 0.31     | 0.1     |
| Texture Update | _        | 0.30     | _       |
| $Total^1$      | 82.3     | 23.2     | 3.5     |





W Limprasert, AM Wallace and G Michaelson. "Real-time People Tracking in a Camera Network", IEEE Journal on Circuits and Systems, 263-271 June 2013

# What kind of Dynamic Imaging? Sensing Forests: Multispectral LiDAR

Using Multi- or Hyper-spectral Lidar, it is possible to sense a single footprint, or build a 3D image of the scene below. This presents challenges to spectrally unmix pixels and images, such that structure, materials and material variation can be inferred.







AM Wallace, A McCarthy, C Nichol, X Ren, S. Morak<sup>2</sup>, D Martinez-Ramirez, I. H. Woodhouse and GS Buller, "Design and of Evaluation of Multi-spectral LiDAR for the Recovery of Arboreal Parameters" IEEE Transactions on Geoscience and Remote Sensing, 52(8), 4942-4954, 2014

## What's the problem?

Code development for parallel or heterogeneous architectures .... e.g. it took several months (and inevitable algorithmic changes) to hand-craft Beowulf code to analyse (RJMCMC) full waveform multispectral LiDAR for tree canopy data, to recover structure and physiology.



Single footprint data



Fig. 17. Final fitting result of the real data (shown in Fig. 4) using DP SSD-RJMCMC.

J Ye, AM Wallace A Al Zain and J Thompson, Parallel Bayesian Inference of Range and Reflectance from LaDAR Profiles, Journal of Parallel and Distributed Computing, 73(4), 383–399, 2013.





# What's the problem? Simple HOG on an IpPro

Direct transformation to a custom FPGA, or (better) direct FPGA Coding of HOG using the IpPro (QUB) is laborious, and not necessarily optimal.



Fahad Manzoor Siddiqui, Matthew Russell, Burak Bardak, Roger Woods, Karen Rafferty IPPro: FPGA based Image Processing Processor, Proc GlobalSIP Conference 2014

#### Wouldn't it be nice if .......

- We could express any given algorithm as high level abstractions, drastically reducing code development time, yet
- easily translate that code into executable code for a variety of parallel architectures, and
- transform that code (using either analysis or profiling) to optimise "performance", e.g.
  - √ ... for speed, memory use, power consumption, cost, and
- either use a single platform, or mix and match processors (e.g. CPU, FPGA, GPU, IpPro) to meet the desired objectives, yet
- match or even better "hand-crafted" code







## Algorithm Development: the Rathlin Model



#### Compiler Flow for FPGA route: RIPL to CAL

- Inline all function calls into the main function.
- Replace all RIPL type declarations to CAL array declarations.
- Generate stream-based actors for each use of a RIPL iterator.
- Derive dataflow wires from implicit data dependencies between RIPL variables.
- Generate CAL files for each actor, where there is one actor per RIPL iterator.
- Generate an XML/XDF file for the wire connections.







FPGA hardware description



# Compiler Flow: CAL to Verilog

The Orcc frontend parses CAL syntax into an abstract syntax tree (AST) which is then mapped to a dataflow IR.

The Xronos backend of the Orcc compiler generates Verilog for each actor, and a VHDL file that describes the network of wires between actors.

It does this by compiling the dataflow IR to a language independent model (LIM) IR which abstracts FPGA hardware.

OpenForge, open sourced by Xilinx, is used to compile LIM IR to Verilog.









# Why DPNs?

- Image processing and analysis algorithms can be classified and developed in broad categories based on their algorithmic description:
  - point, local, global, temporal, adaptive or random.
- These classifications can be used to understand the hardware requirements and memory estimations.
- Early exposure to these requirements in dataflow representations can be used in optimisation, resource allocation and code synthesis.
- There is an established and active community working around the open source ORCC tools







## Mean Shift Algorithm – Exemplar



Originally using a fixed template and a colour model with an Epanechnikov Kernel, this algorithm has been used many times (>8000 citations) and has been adapted in many ways for image segmentation and Object Tracking. For our purposes, we have several previous language (Matlab, C, Hume, Renesas) codes, there are a number of published hand-crafted FPGA implementations we can compare against, and it is challenging, because of the optimisation loop and necessary precision, but not impossible for the IpPro.

Comaniciu and Meer, IEEE Trans PAMI 24(5), Mean Shift: a robust approach towards feature space analysis, pp603-618, 2002





## DPN Exemplar: Mean Shift Algorithm for Tracking

The basic approach is to create a probability density function (PDF) in frame (n+1), based on the colour histogram in frame (n) or a fixed model, and use an iterative procedure to find the maximum in this PDF that defines the new position of the object being tracked.

Usually, the PDF is based on the similarity between centre-weighted colour histograms, using an Epanechnikov kernel; the similarity function is usually defined from the Bhattacharya distance.

No. of bins in histogram

Bhattacharya distance

$$\rho[\hat{p}(y), q] = \sum_{u=1}^{m} \sqrt{\hat{p}_{u}(y)q_{u}}$$

Target colour histogram at position y







Model colour histogram

# DPN Exemplar: Mean Shift Algorithm for Tracking

The kernel is recursively moved from the starting position in the previous frame  $\hat{y}_0$  to a new position  $\hat{y}_1$  until convergence









# Mean shift tracking algorithm

```
Given object position y_0 in frame n
Compute Epanechnikov kernel
Compute object colour model, q_u(y_0)
Repeat
   Read next frame (n+1)
   Compute object candidate model p_u(y_0)
   Compute similarity function, p(y)
   Repeat
           Derive weights w_i for each pixel in candidate window
           Compute new candidate position, y_1
          Evaluate similarity function, p(y)
   Until |y<sub>1</sub>-y<sub>0</sub>| < € (near zero) or oscillatory or limit
Until end of sequence
```







# What does the CAL Program look like (1)?

First, there is a Dataflow Process Network consisting of several actors – normally encoded with a XDF file that defines the connectivity and parameter passing between the several actors.



See ORCC: Dataflow Programming made easy - <a href="http://orcc.sourceforge.net/">http://orcc.sourceforge.net/</a>







# What does the CAL Program look like (2)?

Second, there are CAL statements within each actor that define its function (e.g. Centre XY). bool while\_loop\_status := false;

updateCentreXY: action ==>

centre x := centre x + dx;

centre y := centre y + dy;

```
package main;
import std.header.Parameter.*;
actor updateCentreXY() int dx i, int dy i ==>
        uint centre x out, uint centre y out, bool loop status:
    uint centre x ;
    uint centre y ;
    int dx;
    int dy;
   int loopcount := 0;
                                    SO
    initialise: action ==>
        centre x := CENTRE X;
        centre y := CENTRE Y;
        loopcount := 0;
    end
    get_dx_dy: action dx_i:[val_x], dy_i:[val_y] ==>
        dx := val x;
        dy := val y;
    end
```

This Actor has four actions scheduled by

a FSM



```
S2
                loopcount := loopcount + 1;
                if(((dx=0) && (dy=0)) || (loopcount>20) ) then
                     while loop status := false;
                else
                     while loop status := true;
                end
            end
             send: action ==> centre x out:[val x],
                             centre y out:[val y], loop status:[val_loop]
            var
                uint val x,
                uint val y,
                bool val loop
                                                   S3
                val x := centre x;
                val y := centre y;
                val loop := while loop status;
            end
            schedule fsm s0 :
                s0 (initialise ) --> s1;
                s1 (get_dx_dy ) --> s2;
                                                                   FSM
                s2 (updateCentreXY ) --> s3;
                 s3 (send ) --> s1;
            end
Engineerin
Research Council
```

# Original Version: FPGA Synthesis of each Actor

|                   | Slice LUT | Slice registers | Block RAM | DSP48E | FMax   |
|-------------------|-----------|-----------------|-----------|--------|--------|
|                   |           |                 | /FIFO     |        | (MHz)  |
| Naive             | 3664      | 8777            | 88        | 49     | 55.41  |
| Final_XY          | 76        | 80              | 0         | 0      | 721.48 |
| Centre_XY         | 182       | 199             | 0         | 0      | 530.81 |
| Stream_to_YUV     | 90        | 287             | 24        | 0      | 420.07 |
| update_model      | 1042      | 2399            | 30        | 0      | 148.74 |
| YUV2RGB           | 300       | 957             | 7         | 0      | 126.71 |
| displacement      | 545       | 1326            | 2         | 9      | 73.40  |
| update_weight     | 556       | 1544            | 14        | 4      | 66.46  |
| kArray_derv       | 437       | 1074            | 1         | 18     | 55.44  |
| kArray_evaluation | 460       | 1148            | 1         | 18     | 55.41  |







# **DPN: Applying Transformations**

|   | Transformation      | Description                                             |
|---|---------------------|---------------------------------------------------------|
|   | Transformation      |                                                         |
| 1 | Actor fusion        | Combines multiple actors into a single actor.           |
| 2 | Actor fission       | Load balance data between replicas of an actor.         |
| 3 | Loop fission        | Load balance data between replicas of a loop.           |
| 4 | Actor pipelining    | Pipelines an actor's instructions into separate actors. |
| 5 | Task parallelism    | Decomposes an expression into separate sub-expression.  |
| 6 | Loop elimination    | Replaces loops with on-the-fly streaming.               |
| 7 | FSM simplification  | Re-writes FSMs so that all states are always live.      |
| 8 | Built-in constructs | Use constructs with optimised FPGA implementations.     |

Other computational transformations are considered for FPGAs, notably Floating vs Fixed point implementation and the use of LUTs







# Example 1: Data Parallelism & Actor Fission

#### **Transformation 2** Data parallelism with actor fission



 $ACTOR A In \Longrightarrow Out :$  action In: [x]  $\Longrightarrow Out: [x]$ 

ACTORB In  $\Longrightarrow$  Out : action In: [x]  $\Longrightarrow$  Out: [F x]

ACTORC In $\Longrightarrow$ Out: action In:[x] $\Longrightarrow$ Out:[x]

 $\mathcal{ACTOR} \land \land \Rightarrow 01,02 :$ action In: [x1,x2]  $\Longrightarrow 01:$  [x1], 02: [x2]

 $\mathcal{ACTOR} B_1 A \Longrightarrow Out :$ action  $In: [x] \Longrightarrow Out: [\mathcal{F} x]$ 

 $\mathcal{ACTOR} B_2 A \Longrightarrow Out :$ action In: [x]  $\Longrightarrow Out: [\mathcal{F} x]$ 

 $\mathcal{ACTORC} \land \Longrightarrow \mathsf{Out}:$ action In: [x]  $\Longrightarrow \mathsf{Out}:$  [x]







# Actor Fission: applied to update weights

```
Update weights: action ==>
Do
/* data parallelisable */
Foreach int I in 0 .... (NUMBINS) -1 do
    If (Pu model buffer[i] = 0) then
           R[i] := 0:
    Else
           sqrt ((Qu_model_buffer[i]/Pu_model_buffer[i])));
           R[i] := sqrtvalue;
    End
End
/* barrier necessary between two loops – not task parallelisable */
/* data parallelisable, but not cost-effective */
For each int x in 0 .... (X_SIZE-1) do
    Foreach int y in 0 .... (Y SIZE-1) do
           weight_buffer[x][y] := R[bin_buffer[x][y]];
    End
End
End
```







# Actor Fission: Update weight (Mean Shift)



(a) before









# Example 2: Task Parallelism

#### **Transformation 5** Decompose expression into task parallel actors









# Task Parallelism: Displacement (Mean Shift)









# Applying Transformations to Mean Shift

| Functionality        | Transformation          | Registers | Slice LUTs | BRAM | DSP | Clock (MHz |
|----------------------|-------------------------|-----------|------------|------|-----|------------|
| Stream to YUV        | None                    | 90        | 287        | 24   | 0   | 420.       |
| Stream to 10 v       | Loop elimination        | 27        | 85         | 0    | 0   | 386.       |
| YUV to RGB           | None                    | 300       | 957        | 7    | 0   | 126        |
| TOV to KGB           | Actor fusion            | 99        | 353        | 0    | 0   | 182        |
| Displacement         | None                    | 545       | 1326       | 2    | 9   | 73         |
| Displacement         | Task parallelism        | 791       | 1210       | 7    | 9   | 110        |
|                      | None                    | 556       | 1544       | 14   | 4   | 66         |
| Update weight        | Fission                 | 12352     | 19878      | 55   | 128 | 72         |
|                      | Just square root (none) | 346       | 548        | 0    | 4   | 72         |
|                      | Square root Lookup      | 139       | 227        | 32   | 0   | 368        |
| ecall this is        | Combined                | 7907      | 38544      | 1028 | 0   | 225        |
| only' k-array derive | None                    | 437       | 1074       | 1    | 18  | 55         |
| A array derive       | Loop promotion          | 4447      | 12484      | 5    | 144 | 52         |







#### Final results: Mean Shift

|                                                                               | Slice LUT               | Registers             | BRAM             | DSP48E        | Clock (MHz)           |
|-------------------------------------------------------------------------------|-------------------------|-----------------------|------------------|---------------|-----------------------|
| Naive version                                                                 | 2751                    | 6582                  | 86               | 13            | 66.5                  |
| Optimising HDL for speed<br>Optimising HDL for area<br>Dataflow optimisations | $2751 \\ 2748 \\ 10786$ | 6635<br>6610<br>51267 | 86<br>87<br>1026 | 13<br>13<br>9 | 66.5<br>66.5<br>110.0 |

Table 4: Comparison of Dataflow and HDL Level Optimisation Results

| Functionality                                                                         | Transformation                                                                             | Runtime                                   | FPS                                  |
|---------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------|-------------------------------------------|--------------------------------------|
| Naive version                                                                         |                                                                                            | 2.97s                                     | 43.8                                 |
| Updating the model Displacement Compute tracking window RGB to YUV k-array evaluation | FSM simplification<br>Task parallelism<br>Loop elimination<br>Actor fusion<br>Language use | 2.06s<br>3.17s<br>2.96s<br>2.67s<br>2.34s | 63.1<br>41.0<br>43.9<br>48.7<br>55.6 |
| Combined                                                                              |                                                                                            | 1.70s                                     | 76.5                                 |

Table 5: Transformation effects on CPU results







# Conclusions (the story so far)

- We have applied DPN transformations to optimise algorithms expressed in the CAL dataflow language.
- □ This identifies transformations to target FPGAs; e.g. for Mean Shift, the overall clock frequency is increased from 66.5MHz to 110MHz.
- Applying all CPU targeting transformations increases mean throughput from 43fps to 77fps.
- In general, coding is (arguably) much simplified, e.g. a wavelet transformation is 4 lines of RIPL, 34 lines of CAL, and over 1000 lines of VHDL code.
- We have also developed an IpPro architecture, a partially reconfigurable soft core processor that will continue to evolve







#### **Future Work**

- A key priority is to embed the dataflow transformations as compiler optimisations guided by FPGA simulation and CPU traced-based profiling
- As the project develops, we hope to target the IpPro from both RIPL and Dataflow networks.
- We are developing concurrently new algorithms for dynamic video data analysis,
  - Random Finite Set approach to track multiple targets of two distinct types in clutter (e.g. pedestrians and vehicles, sheep and goats)
  - Crowd density and flow estimation techniques, that we hope to use to improve detection and tracking in sparse and dense populations

R. Stewart, D. Bhowmik, A Wallace, G Michaelson, Profile guided dataflow transformation for FPGAs & CPUs, IEEE Global Conference on Signal and Information Processing, December 2014 + Journal Submit.





